Hospitalizations and costs of ambulatory care sensitive conditions in Brazil

Exploratory and Data Analysis

Author

Alexsander R. Carvalho Junior

Published

November 25, 2025

Introduction

About the data

Ambulatory care sensitive conditions refer to health issues that can be effectively prevented, detected early, treated, and managed through primary health care (APS)1. In Brazil, the Ministry of Health developed its own list of hospitalizations related to these conditions, known as Hospitalizations for Primary Care Sensitive Conditions (ICSAP)2,3.High rates of ICSAP admissions often indicate serious issues with health system access or performance, reflecting potential gaps in preventive care, early diagnosis, or continuity of treatment1. Beyond the clinical implications, these hospitalizations carry a significant economic burden, with avoidable admissions representing billions in health care expenditures.

Objective

This exploratory data analysis aims to quantify the clinical and financial impact of ICSAP in the Brazilian SUS from 2014 to 2024, while assessing data quality and identifying inconsistencies to ensure robust insights on primary care gaps and potential cost savings.

Methodology

Hospital Information System (SIH), is an administrative database managed by the Ministry of Health, which also provides the official ICSAP list3. Macro-region data are sourced from the Brazilian Institute of Geography and Statistics (IBGE).

When processing the SIH-SUS data into the silver layer, hospitalizations whose primary diagnosis was related to obstetric causes or childbirth (ICD-10 codes O80–O84) were excluded4.

This document presents descriptive analyses of the percentage of ICSAP admissions and their costs, broken down annually by age group, gender, and Brazilian macro-regions. Results are initially presented for the period 2014–2024, followed by a more detailed look at data from 2024. Additionally, the most frequent ICSAP admissions are analyzed according to their primary diagnoses.

It is important to note that the values presented here may differ from other publications due to specific methodological choices made in this study

Data contains

The database consists of multiple datasets. Here will be explored data from the sih_sus table.

  1. id_hospitalization: Id number of the hospitalization.
  2. year: Year of hospitalization (yyyy).
  3. month: Month of the hospitalization (mm).
  4. state_code: State code of healthcare provider.
  5. length_stay: Number of days spent in hospital.
  6. gender: Subject’s gender.
  7. age: Subject’s age (years).
  8. race_color_code: Code for the Race/color of the subject.
  9. disease_code: Code of the principal diagnostic based in the ICD-10.
  10. hospitalization_type_code: Code of the type of hospitalization.
  11. complexity_code: Code for the hospitalization complexity.
  12. bed_speciality_code: Code of the hospitalization bed speciality.
  13. requested_procedure_code: Code for the requested procedure.
  14. total_cost: Total costs of the hospitalization (BR R$).
  15. death_flag: Indicate subject death.

Getting started

The first step is to import all the libraries required for the analysis.

import duckdb
from pathlib import Path
from IPython.display import display
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Next, we establish a connection with the database, as the data is stored in a DuckDB file.

# Set the file path
base_dir = Path.cwd().parent.parent
duckdb_file = base_dir / "sih_sus.duckdb"

# Connect with the database
con = duckdb.connect(str(duckdb_file))

Then, inspect the dimmension of the sih_sus dataset, reviewing the first ten rows, column names and types, basic calculations and presence of null vales.

sih_10 = con.sql("""
SELECT * FROM silver.sih_sus LIMIT 10
""").df()

sih_info = con.sql("""
DESCRIBE silver.sih_sus
""").df()

sih_summa = con.sql("""
SUMMARIZE silver.sih_sus
""").df()

print("First ten roles of the SIH SUS dataset:")
display(sih_10)
print("-------------------------------------------------------------")

print("Table atributes:")
display(sih_info)
print("-------------------------------------------------------------")

print("SIH SUS data summarization:")
display(sih_summa)
First ten roles of the SIH SUS dataset:
id_hospitalization year month state_code age gender race_color_code disease_code hospitalization_type_code complexity_code bed_speciality_code requested_procedure_code length_stay total_cost death_flag
0 3513121070635 2014 2 35 64 Male 1 I500 2 2 03 0303060212 13 1260.71 0
1 3513121073913 2014 2 35 60 Female 1 I802 1 2 03 0303060298 17 548.97 0
2 3513121073968 2014 2 35 46 Male 1 C910 2 2 03 0304100021 7 587.90 1
3 3513121074012 2014 2 35 39 Male 1 J690 2 2 03 0303140151 8 1636.74 0
4 3513121074034 2014 2 35 75 Female 1 I615 2 2 03 0303040149 16 2126.65 1
5 3513121074045 2014 2 35 48 Female 1 T838 1 3 03 0506020045 1 138.36 0
6 3513121074067 2014 2 35 48 Female 1 N390 2 2 03 0303150033 0 298.21 0
7 3513121074089 2014 2 35 30 Male 1 J690 2 2 03 0303140151 23 7591.28 0
8 3513121074090 2014 2 35 48 Female 1 T838 2 3 03 0506020045 7 732.96 0
9 3513125119735 2014 2 35 61 Female 1 A499 1 2 03 0303010037 10 865.91 0
-------------------------------------------------------------
Table atributes:
column_name column_type null key default extra
0 id_hospitalization VARCHAR YES None None None
1 year SMALLINT YES None None None
2 month SMALLINT YES None None None
3 state_code SMALLINT YES None None None
4 age SMALLINT YES None None None
5 gender VARCHAR YES None None None
6 race_color_code SMALLINT YES None None None
7 disease_code VARCHAR YES None None None
8 hospitalization_type_code SMALLINT YES None None None
9 complexity_code SMALLINT YES None None None
10 bed_speciality_code VARCHAR YES None None None
11 requested_procedure_code VARCHAR YES None None None
12 length_stay SMALLINT YES None None None
13 total_cost DECIMAL(14,2) YES None None None
14 death_flag INTEGER YES None None None
-------------------------------------------------------------
SIH SUS data summarization:
column_name column_type min max approx_unique avg std q25 q50 q75 count null_percentage
0 id_hospitalization VARCHAR 1113100216734 9924300005444 94038206 None None None None None 120119838 0.0
1 year SMALLINT 2014 2024 12 2019.1870472885587 3.2104981687250445 2016 2019 2022 120119838 0.0
2 month SMALLINT 1 12 13 6.534843337034803 3.418173041111256 4 7 9 120119838 0.0
3 state_code SMALLINT 11 53 21 32.774509277976215 9.441504986848388 27 33 41 120119838 0.0
4 age SMALLINT 0 99 98 42.920546196540826 24.88953097272519 23 43 63 120119838 0.0
5 gender VARCHAR Female Male 2 None None None None None 120119838 0.0
6 race_color_code SMALLINT 1 99 6 20.721369595478556 38.16587152644139 1 3 3 120119838 0.0
7 disease_code VARCHAR A00 Z999 11377 None None None None None 120119838 0.0
8 hospitalization_type_code SMALLINT 1 6 6 1.799002376276931 0.5582217046601471 2 2 2 120119838 0.0
9 complexity_code SMALLINT 2 3 2 2.0828147470528555 0.2756020053551203 2 2 2 120119838 0.0
10 bed_speciality_code VARCHAR 01 Saúde Mental (Clínico) 19 None None None None None 120119838 0.0
11 requested_procedure_code VARCHAR 0201010038 0506020134 1685 None None None None None 120119838 0.0
12 length_stay SMALLINT 0 364 356 5.65983494749635 8.111488519371962 2 3 6 120119838 0.0
13 total_cost DECIMAL(14,2) 0.00 274207.48 2315519 1513.1952969215627 3805.741299234852 322 607 1135 120119838 0.0
14 death_flag INTEGER 0 1 2 0.051493725790739076 0.2210025393554081 0 0 0 120119838 0.0

The initial exploration reveals some key characteristcs of the dataset. SIH-SUS dataset contains 120,119,838 records across eleven years but with 26,081,632 non unique values for id_hospitalization code. This need to be checked!

Min and Max values for the year, lenght_stay and age columns suggests that SIH-SUS covers data from 2014 to 2024, that some individual were admitted and discharged at the same day (0 values) and that the data covers a range (0 to 99) of individuals from newborn to elderly. Also, the null_percentage is 0.0 for all columns. The maximum value of 364 days spent shows that there are very long-term hospitalizations in the sih_sus dataset, which is expected but worth verifying.

The Min value of 0 for total_cost is interesting but may not be an error because some scenarios can be recorded as having zero cost, as such:

  • If the main payment is registered on the first AIH, subsequent monthly records might appear with a cost of zero.
  • When a patient is transferred between hospitals
  • When certain procedures or administrative adjustments within the system might not have a direct cost associated with them but still generate a record.

Let’s have a look into some duplicated id_hospitalization

some_duplicates = con.sql("""
SELECT
    id_hospitalization,
    COUNT(*) AS total_duplicates
FROM silver.sih_sus
WHERE year = 2014 AND state_code = 35
GROUP BY ALL
HAVING COUNT(*) > 1
LIMIT 10
""").df()

some_duplicates
id_hospitalization total_duplicates
0 3513121299479 4
1 3514100438156 6
2 3514100434273 7
3 3514100430918 4
4 3514105710522 9
5 3513116876159 4
6 3512118708549 12
7 3513120582587 12
8 3510110855735 12
9 3508102554726 12

The ids vary in number of repetitions. We need to be sure they are not just an error. So let’s search for duplicates in Rio de Janeiro state at 2023, as an example, based on the id_hospitalization, month, age, gender, disease_code, length_stay and total_cost.

sih_dups = con.sql("""
SELECT
    id_hospitalization,
    month,
    age,
    gender,
    disease_code,
    length_stay,
    total_cost
FROM silver.sih_sus
WHERE year = 2023 AND state_code = 33
QUALIFY COUNT(*) OVER (PARTITION BY id_hospitalization) > 1
ORDER BY id_hospitalization, age DESC, total_cost DESC
LIMIT 20
""").df()

sih_dups
id_hospitalization month age gender disease_code length_stay total_cost
0 3308100738316 5 57 Female F200 31 1196.35
1 3308100738316 1 57 Female F200 31 1196.35
2 3308100738316 7 57 Female F200 31 1196.35
3 3308100738316 3 57 Female F200 31 1196.35
4 3308100738316 6 57 Female F200 30 1157.76
5 3308100738316 4 57 Female F200 30 1157.76
6 3308100738316 2 57 Female F200 28 1080.58
7 3308100738350 1 66 Female F205 31 1196.35
8 3308100738350 7 66 Female F205 31 1196.35
9 3308100738350 5 66 Female F205 31 1196.35
10 3308100738350 3 66 Female F205 31 1196.35
11 3308100738350 4 66 Female F205 30 1157.76
12 3308100738350 6 66 Female F205 30 1157.76
13 3308100738350 2 66 Female F205 28 1080.58
14 3308100738437 7 59 Male F205 31 1196.35
15 3308100738437 10 59 Male F205 31 1196.35
16 3308100738437 12 59 Male F205 31 1196.35
17 3308100738437 8 59 Male F205 31 1196.35
18 3308100738437 6 59 Male F205 30 1157.76
19 3308100738437 9 59 Male F205 30 1157.76

Apparently is not an error because they happen all in different months. We can look if the same id_hospitalization repeats over the years and if it has the same pattern.

sih_dups_years = con.sql("""
SELECT
    id_hospitalization,
    year,
    month,
    age,
    gender,
    disease_code,
    length_stay,
    total_cost
FROM silver.sih_sus
WHERE state_code = 33 AND id_hospitalization = '3308100738350' 
ORDER BY id_hospitalization, year, month 
""").df()
display(sih_dups_years.head(20))

sih_dups_years['year'].value_counts().sort_index()
id_hospitalization year month age gender disease_code length_stay total_cost
0 3308100738350 2014 1 57 Female F205 31 1196.35
1 3308100738350 2014 3 57 Female F205 31 1196.35
2 3308100738350 2014 4 57 Female F205 30 1157.76
3 3308100738350 2014 5 57 Female F205 31 1196.35
4 3308100738350 2014 6 57 Female F205 30 1157.76
5 3308100738350 2014 7 57 Female F205 31 1196.35
6 3308100738350 2015 3 58 Female F205 31 1157.76
7 3308100738350 2015 4 58 Female F205 30 1157.76
8 3308100738350 2015 5 58 Female F205 31 1196.35
9 3308100738350 2015 6 58 Female F205 30 1157.76
10 3308100738350 2015 7 58 Female F205 31 1157.76
11 3308100738350 2015 8 58 Female F205 31 1196.35
12 3308100738350 2015 10 58 Female F205 31 1196.35
13 3308100738350 2015 11 59 Female F205 30 1157.76
14 3308100738350 2015 12 59 Female F205 31 1196.35
15 3308100738350 2016 3 59 Female F205 31 1196.35
16 3308100738350 2016 4 59 Female F205 30 1157.76
17 3308100738350 2016 5 59 Female F205 31 1196.35
18 3308100738350 2016 6 59 Female F205 30 1157.76
19 3308100738350 2016 7 59 Female F205 31 1196.35
year
2014     6
2015     9
2016    11
2017    13
2018    12
2019    12
2020    12
2021    12
2022    12
2023     7
Name: count, dtype: int64

It looks like the same patern repeat over the years, therefore can be a continuos treatment/hospitalization.

But, if more than one identical record per month happens it is an error! Let’s check.

some_real_dups = con.sql("""
SELECT
    id_hospitalization,
    month,
    disease_code,
    length_stay,
    total_cost,
    COUNT(*) AS total_duplicates
FROM silver.sih_sus
WHERE year = 2023 AND state_code = 33
GROUP BY ALL
HAVING COUNT(*) > 1
ORDER BY month
LIMIT 10
""").df()

some_real_dups
id_hospitalization month disease_code length_stay total_cost total_duplicates
0 3323101961696 4 F200 28 1341.93 2
1 3323107430852 8 F312 31 2774.79 2
2 3323103325949 8 F70 31 2554.39 2
3 3321102499638 8 F200 31 2554.39 2
4 3323105680037 8 F200 31 1491.04 2
5 3323107430885 8 F200 31 2774.79 2
6 3323105067766 8 F200 31 2554.39 2
7 3322109138117 8 F200 31 2554.39 2
8 3323103424839 8 F29 31 2554.39 2
9 3322109100740 8 F21 31 2554.39 2

Yes, there are some identical records per month, so they need to be filtered out before the analysis.

Now let’s have a look for the number of records with a length of 364 days and the ones with zero cost

sih_364_stay = con.sql("""
WITH unique_obs AS (
    SELECT DISTINCT
        id_hospitalization,
        gender,
        age,
        state_code,
        month,
        disease_code,
        length_stay,
        total_cost
    FROM silver.sih_sus
)
SELECT *
FROM unique_obs
WHERE length_stay = 364;
""").df()

sih_zero_cost = con.sql("""
WITH unique_obs AS (
    SELECT DISTINCT
        id_hospitalization,
        year,
        month,
        disease_code,
        length_stay,
        total_cost
    FROM silver.sih_sus
)
SELECT *
FROM unique_obs
WHERE total_cost = 0
""").df()

print(f"There are {sih_364_stay.shape[0]} observations recorded with 364 days of stay.")
print(f"There are {sih_zero_cost.shape[0]} observations recorded with zero cost.")

display(sih_364_stay)
There are 12 observations recorded with 364 days of stay.
There are 280156 observations recorded with zero cost.
id_hospitalization gender age state_code month disease_code length_stay total_cost
0 3316104693143 Male 10 33 11 J849 364 14935.84
1 3318100547586 Female 55 33 8 K808 364 695.77
2 3319109926638 Female 1 33 8 Q038 364 8762.22
3 3314103512676 Male 7 33 1 A419 364 4781.71
4 2322101258829 Male 30 23 3 K928 364 347.15
5 3316101141441 Male 5 33 1 N47 364 219.12
6 3321108502350 Male 2 33 2 A415 364 5435.85
7 3319109861463 Male 1 33 12 Q743 364 3632.71
8 3118128823865 Female 61 31 1 A418 364 2166.80
9 3316106534906 Male 9 33 11 P271 364 103868.75
10 3515119823435 Male 1 35 11 Q210 364 22481.91
11 1123101678700 Male 4 11 2 G120 364 6453.57

It seems that hospitalizations recorded with a length of 364 days of stay are valid, since they are distributed across different gender, ages, month and brazilian state.

However, there are many entries recorded with cost equal to zero. I do not have enough information about the system to know if these records are valid or an error or anything else, so I will filter them out for further analysis.

Create a table without duplicates into the database.

con.execute("""
CREATE OR REPLACE TABLE silver.sih_sus_eda_clean AS
SELECT DISTINCT
    id_hospitalization,
    year,
    month,
    state_code,
    gender,
    race_color_code,
    age,
    disease_code,
    length_stay,
    total_cost
FROM silver.sih_sus
WHERE total_cost > 0
""")

Exploratory analysis

The next step is to explore and analyze ICSAP profiles and costs within SIH-SUS by join with the cid10_icsap table.

ICSAP Metrics By Year (2014 to 2024)

How have primary care-sensitive hospitalizations and costs evolved over the last ten years?

# Query the results
# Counting totals and hosp costs 
sih_sus_year = con.sql("""
SELECT
    year,
    COUNT(*) AS total_hospitalization,
    COUNT(icsap.avoidable_disease_code) AS total_icsap,
    ROUND(COUNT(icsap.avoidable_disease_code) * 100.0 / COUNT(*), 2) AS rate_icsap,

    SUM(CASE
            WHEN icsap.avoidable_disease_code IS NOT NULL THEN total_cost
        END) AS icsap_cost
FROM silver.sih_sus_eda_clean
LEFT JOIN silver.cid10_icsap AS icsap
    ON disease_code = icsap.avoidable_disease_code
GROUP BY year
ORDER BY year
""").to_df()
Code
# Currency formatter function
def format_brl(value):
    if value >= 1e9:
        return f"R${value / 1e9:.2f}B"
    elif value >= 1e6:
        return f"R${value / 1e6:.1f}M"
    else:
        return f"R${value:,.0f}"
Code
# Create plot
sns.set_theme(style="white", palette="colorblind")
# Bar plot
fig, ax = plt.subplots()

sns.barplot(
    data=sih_sus_year,
    x="year", y="rate_icsap",
    ax=ax
)
ax.bar_label(ax.containers[0], fmt='%.1f%%', fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'left']].set_visible(False)
ax.yaxis.set_visible(False)
ax.set_xlabel("")
plt.show()

# Horizontal bar plot
sih_sus_costs_year = sih_sus_year.sort_values(by='year', ascending=False).reset_index()
year_order = sih_sus_costs_year['year']

fig, ax = plt.subplots()

sns.barplot(
    data=sih_sus_costs_year,
    x="icsap_cost", y="year",
    orient='h',
    ax=ax, order=year_order
)

# Format values for bar labels
formatted_values = [format_brl(v) for v in sih_sus_costs_year["icsap_cost"]]

ax.bar_label(ax.containers[0], labels=formatted_values, padding=-50, fontsize=9, color='white', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()
(a) Percent of ICSAP Over Total Hospitalizations by year
(b) ICSAP Total Cost by Year
Figure 1: Hospitalizations and Cost by Year

From 2014 to 2024, Brazil experienced a gradual decline in the proportion of avoidable hospitalizations (ICSAP) relative to total hospitalizations, dropping from 10.6% in 2014 to 8.4% in 2024. The absolute number of ICSAP cases remained relatively stable over the years.

In terms of financial impact, ICSAP-related costs increased significantly, reaching over R$1.2 billion in 2024.

ICSAP Metrics By year and Brazilian Macro-regions

To add a regional perspective, we join the dataset with the uf_localidade table, which contains information about Brazil’s macro-regions.

sih_sus_regional = con.sql("""
SELECT
    sih.year,
    uf.big_region_name,
    COUNT(*) AS total_hospitalizations,
    COUNT(icsap.avoidable_disease_code) AS total_icsap,
    ROUND(COUNT(icsap.avoidable_disease_code) * 100.0 / COUNT(*), 2) AS rate_icsap,

    ROUND(SUM(sih.total_cost)) AS total_cost,
    ROUND(SUM(CASE 
                  WHEN icsap.avoidable_disease_code IS NOT NULL THEN sih.total_cost
                  ELSE 0 
              END), 2) AS icsap_cost,
    ROUND(SUM(CASE 
                  WHEN icsap.avoidable_disease_code IS NOT NULL THEN sih.total_cost
                  ELSE 0 
              END) * 100.0 / SUM(sih.total_cost), 2) AS rate_icsap_cost 

FROM silver.sih_sus_eda_clean AS sih
LEFT JOIN silver.cid10_icsap AS icsap
    ON sih.disease_code = icsap.avoidable_disease_code
LEFT JOIN silver.uf_localidade AS uf
    ON sih.state_code = uf.state_code
GROUP BY sih.year, uf.big_region_name
ORDER BY sih.year, uf.big_region_name
""").to_df()
Code
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")

# Bar plot
fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(
    data=sih_sus_regional,
    x="year",
    y="rate_icsap",
    hue="big_region_name", 
    ax=ax
)
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("ICSAP (%)")
ax.set_xlabel("")
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=5)
plt.tight_layout()
plt.show()

# Line Plot
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(
    data=sih_sus_regional,
    x="year", y="rate_icsap_cost",
    hue="big_region_name",
    marker="o",
    ax=ax
)
ax.set_ylabel("ICSAP Cost (%)")
ax.set_xlabel("")
ax.set_xticks(sorted(sih_sus_regional["year"].unique()))
ax.set_xticklabels(sorted(sih_sus_regional["year"].unique()))
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=5)
plt.tight_layout()
plt.show()
(a) ICSAP (2014 to 2024)
(b) ICSAP Costs (2014 to 2024)
Figure 2: Percent of Avoidable Hospitalizations and Costs by Brazilian Macro-Regions

The North consistently shows the highest ICSAP rates — over 10% annually — indicating structural challenges in primary care access. The Northeast follows closely, with high volumes and cost shares, reflecting its large population and uneven healthcare coverage. Meanwhile, the South and Southeast maintain lower ICSAP rates and cost proportions, though the Southeast still incurs the highest absolute costs due to its large population and hospitalization scale. The Centro-Oeste region remains intermediate in both metrics.

ICSAP Metrics by the Last Year (2024)

Focusing on 2024, we analyze ICSAP by region, age group, and gender.

con.sql("""
SELECT COUNT(*)
FROM silver.sih_sus_eda_clean AS sih
LEFT JOIN silver.cid10_icsap AS icsap
    ON sih.disease_code = icsap.avoidable_disease_code
WHERE year = 2024 AND icsap.avoidable_disease_code IS NOT NULL
""").show()
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      1103170 │
└──────────────┘

As the dataset reduced to 1,103,170 rows (more than 99%) we will query the data into a pandas dataframe to continue with analysis.

# Query the data
sih_icsap_2024 = con.sql("""
SELECT
    uf.big_region_name,
    uf.state_name,
    sih.length_stay,
    sih.gender,
    sih.race_color_code,
    sih.age,
    icsap.disease_category,
    sih.total_cost
FROM silver.sih_sus_eda_clean AS sih
LEFT JOIN silver.cid10_icsap AS icsap
    ON sih.disease_code = icsap.avoidable_disease_code
LEFT JOIN silver.uf_localidade AS uf
    ON sih.state_code = uf.state_code
WHERE sih.year = 2024
    AND sih.total_cost > 0
    AND icsap.avoidable_disease_code IS NOT NULL
""").to_df()

# Data summary
def column_summary(df):
    summary_data = []

    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df

summary_df = column_summary(sih_icsap_2024)

print("Summary of the data:")
display(summary_df)

print("------------------------------------------------------------------------")

describe_df = sih_icsap_2024.describe()
print("Descriptive statistic of the data:")
display(describe_df)
Summary of the data:
col_name col_dtype num_of_nulls num_of_non_nulls num_of_distinct_values
0 big_region_name object 0 1103170 5
1 state_name object 0 1103170 27
2 length_stay int16 0 1103170 186
3 gender object 0 1103170 2
4 race_color_code Int16 3 1103167 5
5 age int16 0 1103170 100
6 disease_category object 0 1103170 19
7 total_cost float64 0 1103170 214100
------------------------------------------------------------------------
Descriptive statistic of the data:
length_stay race_color_code age total_cost
count 1.103170e+06 1103167.0 1.103170e+06 1.103170e+06
mean 6.293833e+00 2.321867 5.156380e+01 1.129597e+03
std 7.479414e+00 0.968601 2.800763e+01 2.678282e+03
min 0.000000e+00 1.0 0.000000e+00 2.198000e+01
25% 2.000000e+00 1.0 2.900000e+01 3.280100e+02
50% 4.000000e+00 3.0 5.900000e+01 5.207200e+02
75% 7.000000e+00 3.0 7.400000e+01 8.659100e+02
max 3.430000e+02 5.0 9.900000e+01 1.679245e+05

Overall, the data reveals high heterogeneity in both cost and clinical profiles, with strong variability across hospitalizations.

The dataset has fill missing values (only 3 in race_color_code). It spans five macro-regions, 27 states, and includes 19 disease categories. The total_cost variable shows high variability, with a range from R$21.98 to R$167,924.50. The average hospitalization cost is approximately R$1,129.61, but the standard deviation is more than twice that (R$2,678.28), indicating a highly skewed distribution with probably substantial outliers.

Length_stay also exhibits variability, with a mean of 6 days and a standard deviation of 7 days. While most hospitalizations are short (median of 4 days), some extend up to 343 days, suggesting the presence of complex or chronic cases. The average age is 51 years, and the interquartile range (29 to 74) reflects a concentration of adult and elderly patients.

Thus, the statistic suggest that while most cases are relatively short and moderately costly, a subset of patients drives up the average.

So, we can analyse if this could be related due to expanded stays and or age group.

Total Hospitalizations and Costs

First, make a copy of the dataset and create an age group and race names.

# Make a copy of the data
df = sih_icsap_2024.copy()

#Create an age group
bins = [-1, 4, 14, 24, 44, 64, float('inf')]
labels = ['0-4', '5-14', '15-24', '25-44', '45-64', '65+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True)

#Map race names (IBGE Brazil)
race_names_map = {
    1: "Branca",
    2: "Preta",
    3: "Amarela",
    4: "Parda",
    5: "Indígena"
}

df['race_color'] = df['race_color_code'].map(race_names_map)

df.head(10)
big_region_name state_name length_stay gender race_color_code age disease_category total_cost age_group race_color
0 Sul Paraná 7 Female 1 48 Doenças Cerebrovasculares 1184.66 45-64 Branca
1 Sul Paraná 5 Female 1 96 Infecção no Rim e Trato Urinário 1063.47 65+ Branca
2 Sul Paraná 4 Female 3 54 Doenças Cerebrovasculares 2113.56 45-64 Amarela
3 Nordeste Pernambuco 5 Female 2 55 Doenças Cerebrovasculares 3717.67 45-64 Preta
4 Nordeste Pernambuco 11 Female 3 56 Doenças Cerebrovasculares 1170.17 45-64 Amarela
5 Nordeste Pernambuco 3 Female 3 60 Hipertensão 264.32 45-64 Amarela
6 Nordeste Pernambuco 4 Male 3 4 Gastroenterites Infecciosas e complicações 356.90 0-4 Amarela
7 Nordeste Pernambuco 5 Male 3 51 Úlcera gastrointestinal 870.73 45-64 Amarela
8 Nordeste Pernambuco 6 Female 3 66 Diabetes melitus 4137.38 65+ Amarela
9 Nordeste Pernambuco 9 Female 3 80 Infecção no Rim e Trato Urinário 586.09 65+ Amarela

We can analyse the relationship between lenght of stay and the total cost and visualize the distribution of the age groups by doing a scatter plot.

# Get a sample of the data
sample_df = df.sample(n=100000, random_state=42)
Code
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")

g = sns.relplot(
    data=sample_df,
    x='length_stay', y='total_cost',
    col='age_group', hue='gender', col_wrap=3,
    kind='scatter', height=4, aspect=1.2, alpha=0.5 
)
sns.move_legend(
    g, "upper center",
    bbox_to_anchor=(.5, 1.0), 
    ncol=2, title=None, frameon=False
)
g.set_axis_labels("Length of Stay (Days)", "Total Cost (R$)")
g.set_titles("Age Group: {col_name}")
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()

# We can filter the data to lenght stays equal or less then 100 days and
# costs equal or less then 60K for better visualization
filtered_df = sample_df[
    (sample_df["length_stay"] <= 100) &
    (sample_df["total_cost"] <= 60000)
]

g = sns.relplot(
    data=filtered_df,
    x='length_stay', y='total_cost',
    col='age_group', hue='gender', col_wrap=3,
    kind='scatter', height=4, aspect=1.2, alpha=0.5 
)
sns.move_legend(
    g, "upper center",
    bbox_to_anchor=(.5, 1.0), 
    ncol=2, title=None, frameon=False
)
g.set_axis_labels("Length of Stay (Days)", "Total Cost (R$)")
g.set_titles("Age Group: {col_name}")
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
(a) ICSAP Cost vs Length of Stay by Age Group (2024)
(b) ICSAP Cost vs Length of Stay by Age Group (2024) - Filtered
Figure 3: Analysis of the relationship between cost and length of stay

There doesn’t appear to be a clear relationship between length of stay and total cost. We observe both short hospitalizations with high costs and long stays with either high or low costs.

Both age groups are represented across the spectrum, but individuals aged 65+ and 45–64 seem to have a higher concentration of cases with elevated costs and/or extended durations.

The presence of extreme values and wide interquartile ranges suggests that while most hospitalizations are moderately priced, a subset incurs significantly higher costs.

Now, we can analyse the statistic grouped by age group and gender to see the patterms.

summary_gender = df.groupby(["gender"])["total_cost"].describe().round(2)

summary_age = df.groupby(["age_group"])["total_cost"].describe().round(2)

display(summary_gender)
display(summary_age)
count mean std min 25% 50% 75% max
gender
Female 557906.0 1055.83 2548.52 21.98 317.64 487.21 805.72 136618.59
Male 545264.0 1205.07 2802.84 21.98 340.90 574.46 900.97 167924.52
count mean std min 25% 50% 75% max
age_group
0-4 105421.0 781.47 1858.71 40.38 332.90 447.48 625.44 90287.23
5-14 90342.0 732.19 1633.27 40.38 332.90 410.85 622.42 74917.45
15-24 51335.0 813.00 2089.71 40.38 226.68 332.90 582.42 116257.54
25-44 132312.0 1077.99 3317.94 23.45 250.68 421.14 835.46 136618.59
45-64 263406.0 1317.56 2963.40 21.98 347.15 582.41 989.52 136999.63
65+ 460354.0 1229.90 2664.24 21.98 355.15 587.07 927.57 167924.52
Code
# Counting the total of hospitalizations
# By gender
count_gender = df.groupby(["gender"]).size().reset_index(name='total_hospitalizations')
# By age group
count_age_group = df.groupby(["age_group"]).size().reset_index(name='total_hospitalizations')

# Create the plots
sns.set_theme(style="white", palette="colorblind")

# Bar plot for gender
fig, ax = plt.subplots()

sns.barplot(
    data=count_gender,
    x="gender", y="total_hospitalizations",
    hue='gender',
    ax=ax
)
for container in ax.containers:
    ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("Total Hospitalizations")
ax.set_xlabel("")
plt.show()

# Bar plot for age group
fig, ax = plt.subplots()

sns.barplot(
    data=count_age_group,
    x="age_group", y="total_hospitalizations",
    hue='age_group',
    ax=ax
)
for container in ax.containers:
    ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("Total Hospitalizations")
ax.set_xlabel("")
plt.show()
(a) ICSAP by Gender (2024)
(b) ICSAP by Age Group (2024)
Figure 4: Total Hospitalizations by Gender and Age Group
Code
# Create the plots
sns.set_theme(style="whitegrid", palette="colorblind")

# Gender
fig, ax = plt.subplots()

sns.boxplot(data=sample_df, x='gender', y='total_cost', hue='gender')
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()

# Gender no outliers
fig, ax = plt.subplots()

sns.boxplot(
    data=sample_df, 
    x='gender', y='total_cost', hue='gender',
    showfliers=False # Hide outliers
) 
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()

# Age group
fig, ax = plt.subplots()

sns.boxplot(data=sample_df, x='age_group', y='total_cost', hue='age_group')
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()

# Age group filtered
fig, ax = plt.subplots()

sns.boxplot(
    data=sample_df,
     x='age_group', y='total_cost', hue='age_group',
     showfliers=False # Hide outliers
)
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()
(a) ICSAP Cost by Gender (2024)
(b) ICSAP Cost by Gender (2024) - Without outs
(c) ICSAP Cost by Age Group (2024)
(d) ICSAP Cost by Age Group (2024) - Without outs
Figure 5: Total Cost by Gender and Age Group

For gender, males have a higher average cost (R$1,205.07) compared to females (R$1,055.83), and also a higher standard deviation (R$2,802.84 vs. R$2,548.52), indicating greater dispersion in male hospitalization expenses. Both genders share the same minimum cost (R$21.98), but males reach a higher maximum (R$167,924.52), suggesting the presence of extreme outliers.

When analyzing age groups, hospitalization costs tend to increase with age. Children aged 0–4 and 5–14 have the lowest average costs, while adults aged 45–64 and seniors 65+ show the highest averages. The standard deviation also grows with age, reflecting more variability in older populations.

And for races? Is there a difference?

Code
# Counting the total of hospitalizations by race
count_race = (df.groupby(["race_color"])
                .size()
                .reset_index(name='total_hospitalizations')
                .sort_values(by='total_hospitalizations', ascending=False)
            )

# Bar plot
sns.set_theme(style="white", palette="colorblind")

fig, ax = plt.subplots()

sns.barplot(
    data=count_race,
    x="race_color", y="total_hospitalizations",
    hue='race_color',
    ax=ax
)
for container in ax.containers:
    ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("Total Hospitalizations")
ax.set_xlabel("")
plt.show()
Figure 6: Total ICSAP Hospitalizations (2024) by race

We also can have a look in those patterms by macro-regions wiht the metrics agrouped by gender and age group.

# Prepare data
# Metrics agrouoed by gender
df_region_gender = (df.groupby(['big_region_name', 'gender'])
                      .size()
                      .reset_index(name='total_hospitalizations')
                      .sort_values(by='total_hospitalizations', ascending=False)
)

df_region_gender_cost = (df.groupby(['big_region_name', 'gender'])
                           .agg(total_cost=('total_cost', 'sum'))
                           .reset_index()
)

# Metrics agrouped by age group
df_region_age = (df.groupby(['big_region_name', 'age_group'])
                   .size()
                   .reset_index(name='total_hospitalizations')
)

df_region_age_cost = (df.groupby(['big_region_name', 'age_group'])
                        .agg(total_cost=('total_cost', 'sum'))
                        .reset_index()
)

# Metrics agrouped by race
df_region_race = (df.groupby(['big_region_name', 'race_color'])
                    .size()
                    .reset_index(name='total_hospitalizations')
)
Code
# Define the order
region_order = df_region_gender['big_region_name']

# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")
# For gender count
fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(
    data=df_region_gender,
    x='big_region_name', y='total_hospitalizations',
    hue='gender',
    ax=ax, order=region_order
)
for container in ax.containers:
    ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'left']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=2)
ax.yaxis.set_visible(False)
ax.set_xlabel("")
plt.show()

# For gender cost
fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(
    data=df_region_gender_cost,
    x='big_region_name', y='total_cost', hue='gender',
    ax=ax, order=region_order, width=0.9
)
for container in ax.containers:
    values = container.datavalues
    labels = [format_brl(v) for v in values]
    ax.bar_label(container, labels=labels, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'left']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=2)
ax.set_ylabel("")
ax.yaxis.set_visible(False)
ax.set_xlabel("")
plt.show()
(a) ICSAP Hospitalizations (2024)
(b) ICSAP Cost (2024)
Figure 7: Total Hospitalizations and Costs by Macro-Regions and Gender
Code
# Define the order
region_order = ['Sudeste', 'Nordeste', 'Sul', 'Norte', 'Centro-Oeste']

age_group_order = ['65+', '45-64', '25-44', '0-4', '5-14', '15-24']

# Create plot
sns.set_theme(style="white", palette="colorblind")

# For age group count
fig, ax = plt.subplots(figsize=(10, 8))

sns.barplot(
    data=df_region_age,
    x='total_hospitalizations', y='big_region_name',
    orient='h', hue='age_group', hue_order=age_group_order,
    ax=ax, order=region_order
)
for container in ax.containers:
    ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=6)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()

# For age group cost
fig, ax = plt.subplots(figsize=(10, 8))

sns.barplot(
    data=df_region_age_cost,
    x='total_cost', y='big_region_name',
    orient='h', hue='age_group', hue_order=age_group_order,
    ax=ax, order=region_order
)
for container in ax.containers:
    values = container.datavalues
    labels = [format_brl(v) for v in values]
    ax.bar_label(container, labels=labels, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=6)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()
(a) ICSAP Hospitalizations (2024)
(b) ICSAP Cost (2024)
Figure 8: Total Hospitalizations and Costs by Macro-Regions and Age Group
Code
# Define the order
region_order = ['Sudeste', 'Nordeste', 'Sul', 'Norte', 'Centro-Oeste']

race_group_order = df_region_race['total_hospitalizations']

# Create plot
sns.set_theme(style="white", palette="colorblind")
# For gender count
fig, ax = plt.subplots(figsize=(10, 8))

sns.barplot(
    data=df_region_race,
    x='total_hospitalizations', y='big_region_name',
    orient='h', hue='race_color', 
    ax=ax, order=region_order
)
for container in ax.containers:
    ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=5)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()
Figure 9: Total ICSAP Hospitalization (2024) by Race/Color

Across Brazil’s macro regions, hospitalization patterns reveal consistent gender and age-related trends. The Southeast region recorded the highest number of hospitalizations overall, with female patients slightly outnumbering males.

Older adults (65+) accounted for the largest share of hospitalizations in every region, particularly in the Southeast and Northeast. The 45–64 age group followed closely.

Total costs mirrored these patterns. The Southeast incurred the highest total hospitalization costs for both genders. When broken down by age group, the 65+ population consistently drove the highest costs across all regions, peaking in the Southeast at over R$259 million. The 45–64 group also contributed substantially, reinforcing the financial impact of aging populations on the healthcare system.

ICSAP Disease Categories

Let’s look at the avoidable disease categories, the main categories and their expenditures.

disease_category = (
    df.groupby('disease_category')
      .agg(total_hospitalizations=('disease_category', 'size'),
           total_cost=('total_cost', 'sum'),
           avg_cost=('total_cost', 'mean'),
           avg_stay=('length_stay', 'mean'))
      .sort_values(by='total_hospitalizations', ascending=False)
)
styled_tb = (
    disease_category.style
    .background_gradient(cmap='YlGnBu', high=0.2, subset=['total_cost', 'avg_cost', 'avg_stay'])
    .format({'total_cost': 'R$ {:,.2f}'})
)

styled_tb
  total_hospitalizations total_cost avg_cost avg_stay
disease_category        
Infecção no Rim e Trato Urinário 218241 R$ 126,877,175.19 581.362692 5.674319
Doenças Cerebrovasculares 197963 R$ 365,759,200.23 1847.613949 7.290999
Pneumonias bacterianas 168655 R$ 243,172,884.92 1441.836204 6.828223
Diabetes melitus 136172 R$ 151,124,585.96 1109.806612 6.779984
Gastroenterites Infecciosas e complicações 126223 R$ 51,377,873.76 407.040506 3.333283
Infecção da pele e tecido subcutâneo 61046 R$ 52,273,362.37 856.294636 7.341382
Úlcera gastrointestinal 50844 R$ 60,521,054.37 1190.328345 5.599599
Hipertensão 37068 R$ 18,044,255.16 486.787935 4.222132
Insuficiência Cardíaca 26020 R$ 71,171,979.81 2735.279777 8.264374
Doenças preveníveis por imunização e condições sensíveis 23839 R$ 46,300,850.19 1942.231226 13.726541
Doencas pulmonares 22406 R$ 20,922,481.99 933.789252 4.713470
Deficiências Nutricionais 19316 R$ 24,227,126.98 1254.251759 8.666960
Asma 6492 R$ 4,849,799.21 747.042392 3.428990
Infecções de ouvido, nariz e garganta 2239 R$ 670,022.02 299.250567 2.887896
Doenças relacionadas ao Pré-Natal e Parto 1827 R$ 1,143,375.44 625.821259 6.281883
Eplepsias 1485 R$ 1,473,095.85 991.983737 5.622896
Angina 1132 R$ 5,301,099.39 4682.949991 4.873675
Anemia 1121 R$ 446,850.07 398.617368 3.992864
Doença Inflamatória órgãos pélvicos femininos 1081 R$ 480,510.46 444.505513 2.449584

While urinary tract infections top the list in volume, cerebrovascular diseases dominate in total cost, exceeding R$365 million despite fewer admissions. Bacterial pneumonia and diabetes mellitus also represent high-burden conditions, combining frequent hospitalizations with substantial expenditures.

Interestingly, some conditions—such as heart failure and angina—stand out for their high cost per hospitalization. Heart failure, with just over 26,000 admissions, incurred roughly R$2,000 per hospitalization, while angina, with only 1,132 admissions, exceeded R$4,000 per hospitalization.

Immunization-preventable conditions still account for over 23,000 hospitalizations and R$46 million in costs — highlighting gaps in preventive coverage.

# Identify the 3 categories with the most hospitalizations
top3_categories = df['disease_category'].value_counts().nlargest(3).index.tolist()

# Filter the data
filter_top3 = df['disease_category'].isin(top3_categories) 
df_top3 = df[filter_top3].copy()
Code
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")

# For total cost
c = sns.catplot(
    data=df_top3,
    x='age_group', y='total_cost', hue='age_group',
    col='disease_category',
    kind='box', col_wrap=3, 
    showfliers=False # Hide outliers
)
c.set_axis_labels("", "Total Cost (R$)")
c.set_titles("{col_name}")
plt.show()

# For length stay
s = sns.catplot(
    data=df_top3,
    x='age_group', y='length_stay', hue='age_group',
    col='disease_category',
    kind='box', col_wrap=3,
    showfliers=False # Hide outliers
)
s.set_axis_labels("", "Length of Stay (Days)")
s.set_titles("{col_name}")
plt.show()
(a) ICSAP Total Cost Distribution (2024)
(b) ICSAP Length of Stay Distribution (2024)
Figure 10: Total Cost and Length of Stay Distribution by Age Group
# Total hosp and cost by macro-regions
treemap_data = df.groupby(['big_region_name', 'disease_category']).agg(
    total_hospitalizations=('disease_category', 'size'),
    total_cost=('total_cost', 'sum')
).sort_values('total_hospitalizations', ascending=False).reset_index()

treemap_top5 = treemap_data.groupby('big_region_name').head(5)

# Get Brazil top disease category
top_category_brasil = df['disease_category'].value_counts().idxmax()
br_top_hosp = (df['disease_category'] == top_category_brasil).sum()
br_top_cost = df.loc[df['disease_category'] == top_category_brasil, 'total_cost'].sum()

root_label = f"Brazil - 2024"
Code
# Create plot
# Treemap plot
fig_treemap = px.treemap(
    treemap_top5,
    path=[px.Constant(root_label), 'big_region_name', 'disease_category'],
    values='total_hospitalizations',
    color='total_cost',
    color_continuous_scale='YlGnBu',
    custom_data=['total_cost'],
    labels={'total_cost': 'Total Cost'}
)
default_hover = (
    "<b>%{label}</b><br>"
    "Total Hospitalizations: %{value:,d}<br>"
    "Total Cost: R$ %{customdata[0]:,.2f}<extra></extra>"
)
fig_treemap.update_traces(hovertemplate=default_hover)

labels = list(fig_treemap.data[0]['labels'])
root_idx = labels.index(root_label)  # o índice do nó raiz
hovertemplates = [default_hover] * len(labels)
hovertemplates[root_idx] = (
    f"<b>{root_label}</b><br>"
    f"Brazil Top Disease Category: <b>{top_category_brasil}</b><br>"
    f"Total Hospitalizations: {br_top_hosp:,d}<br>"
    f"Total Cost: R$ {br_top_cost:,.2f}<extra></extra>"
)
fig_treemap.data[0].hovertemplate = hovertemplates
fig_treemap.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig_treemap.show()
Figure 11: Top 5 ICSAP Disaese Categories by Macro-Regions (2024)

Conclusions

About the dataset:

  • Dataset size and structure: The SIH-SUS dataset is very large; therefore, it is more efficient to run queries that aggregate information before importing the data into Power BI.

  • Data quality issues:

    • Missing data is minimal, but there are records with zero cost that need filtering for financial analyses.
    • Both cost and length of stay show high variability and skewness, indicating that averages can be influenced by a small subset of complex cases.
    • There are duplicated records by month that need to be handle before ML models.

About the data:

  • Gradual reduction in ICSAP: The proportion of hospitalizations for primary care–sensitive conditions in the SUS decreased between 2014 and 2024 (from 10.6% to 8.4%), suggesting some improvements, though the absolute number of cases remained stable.

  • Significant financial impact: Despite the proportional reduction, absolute costs increased, surpassing R$1.2 billion in 2024, indicating that remaining cases are potentially more complex and costly.

  • Marked regional differences:

    • The North and Northeast maintain higher ICSAP rates, reflecting possible gaps in primary care access and coverage.
    • The Southeast accounts for the highest absolute costs and hospitalizations due to its larger population.
  • Age and gender profiles: Older adults (45–64 and 65+) account for most hospitalizations and costs, reinforcing the impact of population aging.

  • Costs and length of stay: Maybe there is no direct relationship between days of hospitalization and cost, as expenses may depend on several factors beyond duration, including the type and complexity of treatment (e.g., surgeries, ICU), medications and supplies, comorbidities and others. Thus, short stays can be expensive, while some long stays may involve relatively low costs.

  • Main disease categories:

    • Urinary tract infections lead in admission volume.
    • Cerebrovascular diseases generate the highest total costs.
    • Bacterial pneumonia, diabetes mellitus, and heart failure also have a significant impact.
    • Vaccine-preventable conditions still account for over 23,000 admissions, highlighting gaps in preventive coverage.